1. Data Summary

In this report, I seek to determine whether there is a significant difference in income between men and women, and, if so, whether the difference observed varies depending on other factors (e.g., education, marital status, criminal history, drug use, childhood household factors, profession, etc.).

To address this question, I use data collected from the National Longitudinal Survey of Youth, 1979 cohort (NLSY79). “The 1979 Cohort,” according to the project’s webpage, “is a longitudinal study that follows the lives of a sample of American youth born between 1957-64. The cohort originally included 12,686 respondents ages 14-22 when first interviewed in 1979; after two subsamples were dropped, 9,964 respondents remain in the eligible samples” (see https://www.nlsinfo.org/content/cohorts/nlsy79 for more information). To support the analysis that follows, I was provided with a base data set containing just 70 of the tens of thousands of variables included in the original data set. This base data set can be accessed in its original form on the Programming R for Analytics course website (http://www.andrew.cmu.edu/user/achoulde/94842/), along with accompanying files describing each of the variables used.

Graphical and Tabular Summaries of Data

(a) Explore Relationship Between Income, Gender, and Race

gender mean lower upper
Male 53445.91 51115.77 55776.05
Female 29538.51 28386.75 30690.27
## Warning: Ignoring 5662 observations
## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels

## Warning in RColorBrewer::brewer.pal(N, "Set2"): minimal value for n is 3, returning requested palette with 3 different levels
## Warning: Removed 5662 rows containing non-finite values (stat_ydensity).

## # A tibble: 1 x 1
##     mean
##    <dbl>
## 1 53446.
## # A tibble: 1 x 1
##     mean
##    <dbl>
## 1 29539.
##      mean
## 1 23907.4
##    mean
## 1 44.73

The table above shows very clearly that there is a substantial difference in the incomes between male and female respondents in the NLSY Survey 1979 Cohort. While male respondents reported an average income of $53445.91, female respondents reported an average income of only $29538.5073265. In other words, male respondents reported average earnings of $23907.4, or 44.73%, more than their female counterparts.

The distribution of incomes between male and female respondents is seen even more clearly in the boxplot above. This graph shows that men not only have higher median incomes compared to women, but also that they have greater variability in income than do women, especially in the higher income ranges (3rd and 4th quartile ranges).

The violin plot shows more clearly the varying proportion of males and females falling into each portion of the range for income. While both men and women have the highest concentrations of their respective populations in the extreme low range of the distribution, a much higher proportion of men fall into the above $30,000 range than do women, who are pretty densely packed around the $25,000 mark.

## # A tibble: 3 x 3
##   race     count income
##   <fct>    <int>  <dbl>
## 1 Other     7510 50839.
## 2 Black     3174 28325.
## 3 Hispanic  2002 36554.

##           
##             Male Female
##   Other    59.19  59.21
##   Black    25.19  24.84
##   Hispanic 15.62  15.95

The graph above shows that income is strongly correlated with race. Specifically, it shows that Other (non-black, non-hispanic) respondents earned an average income of $50838.84, compared to $36554.36 for Hispanic respondents and $36554.36 for black respondents. If a disproportionate number of female respondents were also women of color, then race may be acting as a confounder in our estimates of the effect of gender on income. However, when we calculate the proportions of male and female respondents per race category, we see roughly equal proportions across all races. We can therefore rule out the possibility that race is driving the differences we observe in income between male and female respondents.

The next analysis examines whether the wage gap observed between men and women also varies by race.

The graph above shows that the wage gap between men and women holds across all races measured. In each case, the difference is statistically significant, as indicated by the fact that the error bars do not cross the 0 line. Moreover, the wage gap is greatest among non-black, non-hispanic respondents (Other), followed by Hispanic, and finally Black.

The proportions table shows that men and women are representated equally among each of the race categories. This suggests that race probably isn’t driving the wage disparity we observe between men and women-i.e., women are not merely appearing to earn less than men because of they are more strongly represented in the disadvantaged race categories. However, the fact that the wage disparity is greater among certain categories than others indicates that a person’s race does influence the magnitude of income disparity they are likely to experience relative to their male counterparts.

The next set of tables and graphs explore whether the wage gap between men and women might be due, in part, to differences in professional qualification and/or occupational choices. First, I look at whether male and female respondents are systematically choosing to work in different industries, and whether these choices may help to explain the difference in income between them.

(b) Evaluate Alternative Hypothesis 1

## Warning: Factor `industry` contains implicit NA, consider using
## `forcats::fct_explicit_na`
industry count income
Health Care and Social Assistance 994 38733.419
Agriculture, Forestry, Fishing, and Hunting 76 32112.960
Mining 37 70624.444
Utilities 75 77828.750
Construction 493 33510.458
Manufacturing 774 51756.430
Wholesale Trade 178 58846.483
Retail Trade 566 31745.682
Transportation and Warehousing 392 49442.971
Information 140 64085.781
Finance and Insurance 270 74106.454
Real Estate and Rental and Leasing 115 41470.730
Professional, Scientific, and Technical Services 304 81616.629
Management, Administrative and Support, and Waste Management Services 402 26233.206
Educational Services 625 41214.819
Arts, Entertainment, and Recreation 106 32663.576
Accomodations and Food Services 308 23457.610
Other Services (Except Public Administration 333 29123.608
Public Administration and Active Duty Military 441 53055.386
Armed Forces 12 43090.909
Not in Labor Force 1 0.000
Uncodeable 37 37223.486
NA 6007 9680.487

## Warning: Factor `industry` contains implicit NA, consider using
## `forcats::fct_explicit_na`
gender industry income
Male Health Care and Social Assistance 72823.775
Male Agriculture, Forestry, Fishing, and Hunting 36664.117
Male Mining 72252.290
Male Utilities 83860.179
Male Construction 33772.281
Male Manufacturing 60809.464
Male Wholesale Trade 68140.036
Male Retail Trade 47518.280
Male Transportation and Warehousing 53169.996
Male Information 85020.792
Male Finance and Insurance 134255.172
Male Real Estate and Rental and Leasing 40185.222
Male Professional, Scientific, and Technical Services 115337.945
Male Management, Administrative and Support, and Waste Management Services 29635.603
Male Educational Services 55653.855
Male Arts, Entertainment, and Recreation 36357.629
Male Accomodations and Food Services 35760.416
Male Other Services (Except Public Administration 41605.676
Male Public Administration and Active Duty Military 64865.785
Male Armed Forces 41500.000
Male Uncodeable 33815.368
Male NA 15467.761
Female Health Care and Social Assistance 31143.160
Female Agriculture, Forestry, Fishing, and Hunting 13908.333
Female Mining 60531.800
Female Utilities 56718.750
Female Construction 31040.217
Female Manufacturing 32504.409
Female Wholesale Trade 41498.517
Female Retail Trade 19137.980
Female Transportation and Warehousing 40951.835
Female Information 37219.183
Female Finance and Insurance 44203.949
Female Real Estate and Rental and Leasing 43157.958
Female Professional, Scientific, and Technical Services 48800.584
Female Management, Administrative and Support, and Waste Management Services 21206.439
Female Educational Services 36623.458
Female Arts, Entertainment, and Recreation 26473.541
Female Accomodations and Food Services 14137.303
Female Other Services (Except Public Administration 18320.415
Female Public Administration and Active Duty Military 43747.191
Female Armed Forces 47333.333
Female Not in Labor Force 0.000
Female Uncodeable 41270.625
Female NA 5957.862

The first bar chart above shows that the highest paying industries, on average, are Finance and Insurance, Professional, Scientific, and Technical Services, Information, and Utilities while the lowest paying industries are Management, Administrative Support, and Waste Management Services, Construction, and Accommodations and Food Services.

A side-by-side comparison of average income across industries by gender shows a slight-to-substantial advantage for men across most industries. This analysis suggests that the wage gap between men and women is not due to differences in choice of industry between the two groups, insofar as men tend to outearn women independently of what industry they’re in. The few exceptions are in the areas of Real Estate and Rental and Leasing and the Armed Forces, where women slightly outearn men on average.

Even in those areas where women are more strongly represented, such as Health Care and Social Assistance and Educational Services, men still tend to earn more on average (see next section’s analysis).

Male Female
Health Care and Social Assistance 5.46 23.98
Agriculture, Forestry, Fishing, and Hunting 1.86 0.44
Mining 0.98 0.15
Utilities 1.77 0.50
Construction 13.60 1.38
Manufacturing 15.76 7.56
Wholesale Trade 3.57 1.79
Retail Trade 7.59 9.33
Transportation and Warehousing 8.35 3.47
Information 2.38 1.82
Finance and Insurance 2.80 5.24
Real Estate and Rental and Leasing 2.01 1.44
Professional, Scientific, and Technical Services 4.54 4.56
Management, Administrative and Support, and Waste Management Services 7.29 4.80
Educational Services 4.51 14.03
Arts, Entertainment, and Recreation 1.98 1.21
Accomodations and Food Services 3.99 5.21
Other Services (Except Public Administration 4.73 5.24
Public Administration and Active Duty Military 5.95 7.24
Armed Forces 0.27 0.09
Not in Labor Force 0.00 0.03
Uncodeable 0.61 0.50

## # A tibble: 22 x 2
##    Var1                                        prop.gap
##    <fct>                                          <dbl>
##  1 Health Care and Social Assistance            -18.5  
##  2 Agriculture, Forestry, Fishing, and Hunting    1.42 
##  3 Mining                                         0.83 
##  4 Utilities                                      1.27 
##  5 Construction                                  12.2  
##  6 Manufacturing                                  8.2  
##  7 Wholesale Trade                                1.78 
##  8 Retail Trade                                  -1.74 
##  9 Transportation and Warehousing                 4.88 
## 10 Information                                    0.560
## # ... with 12 more rows

The side-by-side bar chart above shows that the representation gap between men and women varies across industries, with women being more strongly represented in such industries as Health Care and Social Assistance and Educational Services and men being more strongly represented in Construction and Manufacturing. If those industries for which men were more strongly represented also tended to correspond to higher salaries on average, then this disparity might partially explain the wage gap we observe between men and women. If there is no such correlation, however, then this would not be a likely explanation for the gap we observe.

Referring back to the previous section’s analysis, we find no such correlation between high paying professions and representativeness. Men predominate in two out of the three lowest paying industries noted above, while women predominate in the highest paying industry, Finance and Insurance, are equally represented in the second highest paying industry, and are only slightly underrepresented in the remaining two highest-paying industries. While this analysis doesn’t eliminate the possibility that people’s choice of industry contributes to the wage gap we observe between men and women, it does somewhat weaken the case in favor of that explanation. More clarity could be gained into this relationship by performing a more granular analysis of income by profession using the occupation variable from our base data set. I will not pursue that analysis in this report, however.

In the final section of my analysis of the relationship between income and industry, I compare the wage gap within each industry to further validate the results of the previous sections’ analyses.

## Warning: Factor `industry` contains implicit NA, consider using
## `forcats::fct_explicit_na`
industry male female income.gap
Health Care and Social Assistance 72823.77 31143.160 41680.615
Agriculture, Forestry, Fishing, and Hunting 36664.12 13908.333 22755.783
Mining 72252.29 60531.800 11720.490
Utilities 83860.18 56718.750 27141.429
Construction 33772.28 31040.217 2732.064
Manufacturing 60809.46 32504.409 28305.055
Wholesale Trade 68140.04 41498.517 26641.519
Retail Trade 47518.28 19137.980 28380.300
Transportation and Warehousing 53170.00 40951.835 12218.161
Information 85020.79 37219.183 47801.609
Finance and Insurance 134255.17 44203.949 90051.224
Real Estate and Rental and Leasing 40185.22 43157.958 -2972.736
Professional, Scientific, and Technical Services 115337.94 48800.584 66537.361
Management, Administrative and Support, and Waste Management Services 29635.60 21206.439 8429.164
Educational Services 55653.86 36623.458 19030.397
Arts, Entertainment, and Recreation 36357.63 26473.541 9884.088
Accomodations and Food Services 35760.42 14137.303 21623.113
Other Services (Except Public Administration 41605.68 18320.415 23285.260
Public Administration and Active Duty Military 64865.78 43747.191 21118.594
Armed Forces 41500.00 47333.333 -5833.333
Not in Labor Force NaN 0.000 NaN
Uncodeable 33815.37 41270.625 -7455.257
NA 15467.76 5957.862 9509.898
## Warning: Removed 1 rows containing missing values (position_stack).

The bar chart above shows that wage gap between men and women is far more pronounced in certain industries than in others, with most of these disparities favoring men. The widest disparities in earnings are in the areas of Finance and Insurance, Professional, Scientific and Technical Services, and Information, while the smallest disparities are in the areas of Construction, Real Estate and Rental and Leasing, and Armed Forces, with the latter two categories tending to favor women. In other words, in the industries in which women have an advantage, that advantage tends to be very modest, where the advantage is much larger in those industries that favor men. This analysis provides further evidence that occupational choice is likely not driving the difference we observe in the wages of men and women, although it is an important qualifier, for the reasons just mentioned.

Next, let’s look at the relationship between income and educational attainment (highest_grade).

## Warning: Ignoring 5662 observations
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning: Factor `highest_grade` contains implicit NA, consider using
## `forcats::fct_explicit_na`
gender highest_grade count income
Male 12th grade 1644 35593.851
Male 3rd grade 2 34000.000
Male 4th grade 4 15175.000
Male 5th grade 3 18666.667
Male 6th grade 11 21181.818
Male 7th grade 17 10381.250
Male 8th grade 62 15947.492
Male 9th grade 98 21481.411
Male 10th grade 83 15726.840
Male 11th grade 111 17978.963
Male 1st year college 271 50073.200
Male 2nd year college 323 52006.051
Male 3rd year college 149 60522.845
Male 4th year college 410 99372.714
Male 5th year college 81 87994.613
Male 6th year college 119 126561.248
Male 7th year college 46 124460.444
Male 8th year college or more 90 165950.531
Male NA 2879 NaN
Female 12th grade 1534 20820.893
Female None 2 0.000
Female 3rd grade 7 9257.143
Female 4th grade 2 19000.000
Female 5th grade 2 0.000
Female 6th grade 21 4263.810
Female 7th grade 22 5409.091
Female 8th grade 40 4034.595
Female 9th grade 68 7421.875
Female 10th grade 73 6521.145
Female 11th grade 78 10127.158
Female 1st year college 367 27772.804
Female 2nd year college 428 30467.777
Female 3rd year college 232 28718.323
Female 4th year college 459 47055.389
Female 5th year college 126 47855.312
Female 6th year college 173 53895.641
Female 7th year college 77 70402.479
Female 8th year college or more 66 72784.015
Female NA 2506 NaN
## Warning: Removed 2 rows containing missing values (geom_bar).

In the boxplot provided above, you can see the generally positive effects of additional years of education on average income earned. The boxplot also allows us to see how additional years of education influence the range of incomes that become accessible to people in each class. Many of the higher income categories (e.g., above $100k/year) are reserved almost excusively for those possessing at least a high school diploma (i.e., completed up to 12 years of education). Around the $350k mark, you can see the top-coded values of those earning significantly more than the bulk of the distribution for each class. We can therefore assume that the real average for these higher educational levels is actually somewhat higher than what is displayed, although such outcomes are rare.

The bar chart similarly shows a positive correlation between level of education attainment and average income. With a few minor exceptions, average income tends to increase with every additional level of educational attainment, for both men and women. There are a few minor deviations from this trend among levels of grade school as well as college, but some of these differences likely fall within the margin of error for those measurements, so should not be interpreted as significant. Among the major classes of educational attainment, e.g., from grade school to a bachelors degree, and between different levels of higher education, the difference is much more significant.

Notably, the positive effect of educational attainment on income is much more pronounced for men than women, a pattern that holds across virtually every category of education. The sole exception is for those with a 4th grade education, though again, this difference is likely within the margin of error for this category (n = 9), and therefore should not be interpreted as significant.

Male Female
12th grade 46.65 40.61
None 0.00 0.05
3rd grade 0.06 0.19
4th grade 0.11 0.05
5th grade 0.09 0.05
6th grade 0.31 0.56
7th grade 0.48 0.58
8th grade 1.76 1.06
9th grade 2.78 1.80
10th grade 2.36 1.93
11th grade 3.15 2.07
1st year college 7.69 9.72
2nd year college 9.17 11.33
3rd year college 4.23 6.14
4th year college 11.63 12.15
5th year college 2.30 3.34
6th year college 3.38 4.58
7th year college 1.31 2.04
8th year college or more 2.55 1.75

## # A tibble: 19 x 2
##    Var1                     prop.gap
##    <fct>                       <dbl>
##  1 12th grade                 6.04  
##  2 None                      -0.05  
##  3 3rd grade                 -0.13  
##  4 4th grade                  0.06  
##  5 5th grade                  0.0400
##  6 6th grade                 -0.25  
##  7 7th grade                 -0.100 
##  8 8th grade                  0.7   
##  9 9th grade                  0.980 
## 10 10th grade                 0.430 
## 11 11th grade                 1.08  
## 12 1st year college          -2.03  
## 13 2nd year college          -2.16  
## 14 3rd year college          -1.91  
## 15 4th year college          -0.520 
## 16 5th year college          -1.04  
## 17 6th year college          -1.2   
## 18 7th year college          -0.73  
## 19 8th year college or more   0.800

The two bar charts above show the male-to-female proportional representation per level of educational attainment and male - female difference in proportional representation per level of educational attainment, respectively. While male and female respondents were represented nearly equally across all categories, the slight differences that do exist are telling. Specifically, we find that men are more strongly represented among those who completed up to some high school (9th to 12th grade) and 8 or more years of college, while women are more strongly represented among those who completed up to some college (1 to 7 years). In other words, female respondents were on average better educated than men across the entire sample.

This analysis, like the previous one, provides evidence against the first alternative hypothesis, which proposed that women may be earning lower incomes because of lower educational attainment compared to men. In fact, what this analysis shows is that men are earning more despite having lower educational qualifications than their female counterparts, which is precisely the opposite of what this hypothesis predicted.

In the final section of my analysis of the relationship between income and professional qualifications, I compare the wage gap within each level of educational attainment to further validate the results of the previous sections’ analyses.

## Warning: Factor `highest_grade` contains implicit NA, consider using
## `forcats::fct_explicit_na`
highest_grade count male female income.gap
12th grade 3178 35593.85 20820.893 14772.958
None 2 NaN 0.000 NaN
3rd grade 9 34000.00 9257.143 24742.857
4th grade 6 15175.00 19000.000 -3825.000
5th grade 5 18666.67 0.000 18666.667
6th grade 32 21181.82 4263.810 16918.009
7th grade 39 10381.25 5409.091 4972.159
8th grade 102 15947.49 4034.595 11912.897
9th grade 166 21481.41 7421.875 14059.536
10th grade 156 15726.84 6521.145 9205.695
11th grade 189 17978.96 10127.158 7851.805
1st year college 638 50073.20 27772.804 22300.396
2nd year college 751 52006.05 30467.777 21538.275
3rd year college 381 60522.85 28718.323 31804.522
4th year college 869 99372.71 47055.389 52317.325
5th year college 207 87994.61 47855.312 40139.300
6th year college 292 126561.25 53895.641 72665.607
7th year college 123 124460.44 70402.479 54057.965
8th year college or more 156 165950.53 72784.015 93166.515
NA 5385 NaN NaN NaN
## Warning: Removed 2 rows containing missing values (position_stack).

The bar chart above shows that wage gap between men and women increases as level of educational attainment increases, in favor of men. We see slight drops at irregular intervals, such as 5 years and 7 years of college, which might represent individuals who stopped short of completing a higher level degree, such as a masters, doctorate, or professional degree. Alternatively, it may just represent a small sample size - and therefore larger margin of error - for these categories.

This analysis provides a first line of evidence that professional qualifications are likely not driving the difference we observe in the wages of men and women, insofar as men are benefiting more on average from the positive relationship between educational attainment and income, despite women having the stronger educational credentials on average. The last factor we’ll consider in our evaluation of the first alternative hypothesis is number of jobs, which is being used here as a proxy for professional experience.

gender jobs_number count income
Male 0 6 0.000
Male 1 29 40194.138
Male 2 64 38122.949
Male 3 80 47883.633
Male 4 135 61320.128
Male 5 185 71927.517
Male 6 183 62999.920
Male 7 190 63784.467
Male 8 207 59528.240
Male 9 228 65235.624
Male 10 229 61215.014
Male 11 213 65989.446
Male 12 193 57901.611
Male 13 194 51173.299
Male 14 170 57591.124
Male 15 120 54903.276
Male 16 157 55533.842
Male 17 134 45091.077
Male 18 115 42553.054
Male 19 93 52805.000
Male 20 104 32300.767
Male 21 74 31556.740
Male 22 65 35930.969
Male 23 44 35292.744
Male 24 48 29415.375
Male 25 35 35546.057
Male 26 37 19235.000
Male 27 27 37642.222
Male 28 27 44917.560
Male 29 26 31633.923
Male 30 21 26405.056
Male 31 17 33667.750
Male 32 12 16510.917
Male 33 9 23966.667
Male 34 15 27831.467
Male 35 7 17585.714
Male 36 6 17666.667
Male 37 2 31500.000
Male 38 6 17166.667
Male 39 3 17933.333
Male 40 3 5066.667
Male 41 3 0.000
Male 42 2 46660.000
Male 45 2 32500.000
Male 46 1 0.000
Male 48 1 53000.000
Male 51 1 0.000
Male 52 1 0.000
Male NA 2879 NaN
Female 0 24 0.000
Female 1 47 8500.000
Female 2 90 23874.831
Female 3 119 21838.922
Female 4 170 28158.497
Female 5 202 26475.857
Female 6 240 33109.662
Female 7 256 27601.757
Female 8 281 29636.307
Female 9 234 35015.523
Female 10 234 32428.815
Female 11 220 30439.252
Female 12 233 30574.814
Female 13 177 29223.560
Female 14 184 37431.529
Female 15 181 32352.938
Female 16 140 30429.701
Female 17 105 28461.265
Female 18 114 30736.964
Female 19 91 31286.793
Female 20 74 25816.958
Female 21 51 25668.438
Female 22 57 24111.255
Female 23 44 28983.568
Female 24 37 40409.611
Female 25 28 16593.192
Female 26 29 28499.750
Female 27 22 41228.227
Female 28 19 28161.111
Female 29 15 31640.000
Female 30 15 22476.333
Female 31 7 39428.571
Female 32 6 9823.667
Female 33 4 6050.000
Female 34 3 21000.000
Female 35 5 20164.000
Female 36 2 11680.000
Female 37 5 15179.000
Female 38 3 20000.000
Female 41 4 34750.000
Female 44 1 55000.000
Female 45 1 0.000
Female 47 2 30194.000
Female 58 1 50000.000
Female NA 2506 NaN
## Warning: Removed 2 rows containing missing values (geom_bar).

The bar chart above shows that men are again earning higher incomes on average across most of the range in job numbers. It is not especially clear, either, from this graph what the precise nature of the relationship is between number of jobs and income, except perhaps in the case of a few exceptional individuals at the highest extreme of the distribution, who appear to be benefited by having held more jobs. What we may be seeing here is just the effect of age, with number of jobs held serving as a proxy of the person’s age rather than necessarily their experience. For those respondents who reported holding between 20 and 40 jobs, however, the effect on income appears rather erratic, and possibly even negative. For those individuals who have held a number of jobs close to the average for the sample population, income appears to be more or less stable, suggesting that the effect of this variable on income may be minimal.

Male Female
0 0.17 0.64
1 0.82 1.24
2 1.82 2.38
3 2.27 3.15
4 3.83 4.50
5 5.25 5.35
6 5.19 6.35
7 5.39 6.78
8 5.87 7.44
9 6.47 6.20
10 6.50 6.20
11 6.04 5.82
12 5.48 6.17
13 5.51 4.69
14 4.82 4.87
15 3.41 4.79
16 4.46 3.71
17 3.80 2.78
18 3.26 3.02
19 2.64 2.41
20 2.95 1.96
21 2.10 1.35
22 1.84 1.51
23 1.25 1.16
24 1.36 0.98
25 0.99 0.74
26 1.05 0.77
27 0.77 0.58
28 0.77 0.50
29 0.74 0.40
30 0.60 0.40
31 0.48 0.19
32 0.34 0.16
33 0.26 0.11
34 0.43 0.08
35 0.20 0.13
36 0.17 0.05
37 0.06 0.13
38 0.17 0.08
39 0.09 0.00
40 0.09 0.00
41 0.09 0.11
42 0.06 0.00
44 0.00 0.03
45 0.06 0.03
46 0.03 0.00
47 0.00 0.05
48 0.03 0.00
51 0.03 0.00
52 0.03 0.00
58 0.00 0.03

## # A tibble: 51 x 2
##    Var1  prop.gap
##    <fct>    <dbl>
##  1 0      -0.47  
##  2 1      -0.42  
##  3 2      -0.560 
##  4 3      -0.880 
##  5 4      -0.67  
##  6 5      -0.1000
##  7 6      -1.16  
##  8 7      -1.39  
##  9 8      -1.57  
## 10 9       0.270 
## # ... with 41 more rows

The first bar chart above shows an nearly identical distribution of men and women across the range of number of jobs held, suggesting approximate balance between the genders with respect to this variable. Since neither gender has substantially higher representation in any category along this range, number of jobs is unlikely to explain any difference in income between the genders, regardless of the nature of its relationship to income.

The second bar chart shows a slight disparity in representation of the genders across number of jobs, with women having slightly (<2%) higher representation among the highest ranges of jobs held and men having a slightly higher representation (>1%) among the middle and lower ranges of jobs held. Consistent with our choice above to use number of jobs as a proxy for professional experience, this analysis would suggest that women are more strongly representated among the most experienced categories of workers, again contradicting what was proposed by the first alternative hypothesis.

jobs_number count male female income.gap
0 30 0.000 0.000 0.000000
1 76 40194.138 8500.000 31694.137931
2 154 38122.949 23874.831 14248.117827
3 199 47883.633 21838.922 26044.711172
4 305 61320.128 28158.497 33161.630887
5 387 71927.517 26475.857 45451.659903
6 423 62999.920 33109.662 29890.258117
7 446 63784.467 27601.757 36182.710190
8 488 59528.240 29636.307 29891.933182
9 462 65235.624 35015.523 30220.100917
10 463 61215.014 32428.815 28786.198658
11 433 65989.446 30439.252 35550.193697
12 426 57901.611 30574.814 27326.796047
13 371 51173.299 29223.560 21949.739465
14 354 57591.124 37431.529 20159.595488
15 301 54903.276 32352.938 22550.338362
16 297 55533.842 30429.701 25104.140613
17 239 45091.077 28461.265 16629.811617
18 229 42553.054 30736.964 11816.090418
19 184 52805.000 31286.793 21518.206897
20 178 32300.767 25816.958 6483.809244
21 125 31556.740 25668.438 5888.302226
22 122 35930.969 24111.255 11819.714205
23 88 35292.744 28983.568 6309.176004
24 85 29415.375 40409.611 -10994.236111
25 63 35546.057 16593.192 18952.864835
26 66 19235.000 28499.750 -9264.750000
27 49 37642.222 41228.227 -3586.005050
28 46 44917.560 28161.111 16756.448889
29 41 31633.923 31640.000 -6.076923
30 36 26405.056 22476.333 3928.722222
31 24 33667.750 39428.571 -5760.821429
32 18 16510.917 9823.667 6687.250000
33 13 23966.667 6050.000 17916.666667
34 18 27831.467 21000.000 6831.466667
35 12 17585.714 20164.000 -2578.285714
36 8 17666.667 11680.000 5986.666667
37 7 31500.000 15179.000 16321.000000
38 9 17166.667 20000.000 -2833.333333
39 3 17933.333 NaN NaN
40 3 5066.667 NaN NaN
41 7 0.000 34750.000 -34750.000000
42 2 46660.000 NaN NaN
44 1 NaN 55000.000 NaN
45 3 32500.000 0.000 32500.000000
46 1 0.000 NaN NaN
47 2 NaN 30194.000 NaN
48 1 53000.000 NaN NaN
51 1 0.000 NaN NaN
52 1 0.000 NaN NaN
58 1 NaN 50000.000 NaN
NA 5385 NaN NaN NaN
## Warning: Removed 11 rows containing missing values (position_stack).

The bar chart above shows a general persistent trend of men earning higher incomes regardless of number of jobs. The wage gap is highest near the center of the distribution (around 7-8 jobs), suggesting that number of jobs might not be a particularly significant factor in determining income level. There are large disparities at the extreme high end of the distribution, likely representing a small number exceptional cases. SOmething interesting may be happening here, but whatever that thing is, it likely won’t be generalizable to the broader population.

All in all, our analysis of this variable has not been particularly informative. Consequently, it will be excluded from my final findings.

Having examined the relationship between income and professional qualifications and occupational choices, we’ll now move on to evaluate our second alternative hypothesis, i.e., that the wage gap between men and women is a result of family dynamics. To test this hypothesis, we’ll look at three variables that, taken together, represent the most intuitive sources of possible influence on an individual’s professional decisions and outcomes

(c) Evaluate Alternative Hypothesis 2

## Warning: Ignoring 6177 observations
## Warning: Removed 5662 rows containing non-finite values (stat_ydensity).

## Warning: Factor `marital_status` contains implicit NA, consider using
## `forcats::fct_explicit_na`
gender marital_status count income
Male Never married 889 31129.53
Male Married 2271 69747.32
Male Separated 195 29194.01
Male Divorced 545 40017.00
Male Widowed 17 30776.92
Male NA 2486 35712.16
Female Never married 689 27591.87
Female Married 2395 31809.81
Female Separated 288 22226.29
Female Divorced 689 29151.98
Female Widowed 52 23127.38
Female NA 2170 24445.27

The bar chart above shows the average income per category of marital status, separated by gender. You can see that men earn higher incomes on average than women across all categories of martial status. The difference is most pronounced among married individuals and least pronounced among those who have never been married. The first alternative hypothesis offers a possible explanation for this trend, which is that single individuals, whether male or female, are less likely to be burdened by the responsibilities of parenthood and therefore can devote more energy and attention to their careers, and perhaps even compete for more competitive high paying jobs. In contrast, married individuals are more likely both to have children, as well as to share incomes with their partners. Both of these factors - i.e., having larger families and sharing income with their spouse) would, according to this hypothesis, lead us to expect a decrease in the wages of women relative to men, as couples shift the burdens of parenthood disproportionately onto one partner to allow the remaining partner to fill the role of “breadwinner.” The analyses that follow will help evaluate whether the data suports this explanation.

The boxplot similarly shows higher median incomes for married respondents compared to all other categories, with the upper fence value reaching significantly higher than those of the other categories. The violin plot shows how the population is distributed throughout the various portions of the range for income, with married, divorced and widowed categories being the only categories to feature a somewhat even concentrations of the population into the higher income ranges, with all other categories tapering off pretty precipitously as incomes increase.

Male Female
Never married 22.70 16.75
Married 57.98 58.23
Separated 4.98 7.00
Divorced 13.91 16.75
Widowed 0.43 1.26

## # A tibble: 5 x 2
##   Var1          prop.gap
##   <fct>            <dbl>
## 1 Never married     5.95
## 2 Married          -0.25
## 3 Separated        -2.02
## 4 Divorced         -2.84
## 5 Widowed          -0.83

The bar chart above shows that men are more strongly represented among those respondents that have never been married, while women predominate in every other category, albeit by only a slight (<3%) margin. Our second alternative hypothesis proposed that the wage gap observed between men and women might be partially explained on account of a larger proportion of women being married relative to men. This analysis provides very weak evidence for that hypothesis since approximately 6% more of male respondents were single (never married) and women were very slightly (<.25%) more likely to be married. Likewise, a larger proportion of female respondents tended to be either separated or divorced than male respondents. Insofar as these categories correlate positively with shared incomes and/or having children, then our hypothesis would receive somewhat stronger evidence in its favor.

## Warning: Factor `marital_status` contains implicit NA, consider using
## `forcats::fct_explicit_na`

## # A tibble: 6 x 5
##   marital_status income.gap   upper  lower is.significant
##   <fct>               <dbl>   <dbl>  <dbl>          <dbl>
## 1 Never married       3538.  -1294.  8370.              0
## 2 Married            37938.  34068. 41807.              1
## 3 Separated           6968.    106. 13830.              1
## 4 Divorced           10865.   4702. 17028.              1
## 5 Widowed             7650. -10473. 25772.              0
## 6 <NA>               11267.   3720. 18814.              1

The bar chart above shows that there is a statistically significant difference in the incomes of men and women for the separated, divorced, and married categories, but no statistically significant difference for the never married and widowed categories. The most notable difference by a wide margin is for the married category, where the mean difference in income between men and women is $37937.5157642, in favor of men.

However, since men and women are represented roughly equally within this category (refer to previous section’s analysis), this disparity does not help to explain why men tend to earn higher incomes than women.

gender family_size count income
Male 1 985 35378.04
Male 2 985 49467.98
Male 3 687 58505.22
Male 4 548 77328.13
Male 5 230 76211.10
Male 6 57 65517.04
Male 7 19 23473.68
Male 8 9 21111.11
Male 9 2 17500.00
Male 10 1 0.00
Male 11 1 114000.00
Male NA 2879 NaN
Female 1 771 28023.20
Female 2 1235 30690.20
Female 3 853 31567.35
Female 4 545 29494.64
Female 5 232 26978.25
Female 6 86 21318.51
Female 7 32 15062.50
Female 8 10 30720.00
Female 9 5 14800.00
Female 10 2 66170.50
Female 11 2 3500.00
Female 12 3 0.00
Female 16 1 27000.00
Female NA 2506 NaN
## Warning: Removed 2 rows containing missing values (geom_bar).

The bar chart above shows a positive relationship between income and family size for men up to approximately a family size of 4, while for women, no such relationship exists. Rather, women’s income appears to be roughly flat up to a family size of 4 and then, like men, begins to drop. The two large bars at the higher end of the distribution for family size represent just a few outliers and likely does not generalize to the larger population of our sample.

Across almost all categories of family size, we again observe an advantage in income for men. This trend is consistent with our second alternative hypothesis, which proposed that as couples decide to start families, the two partners in the relationship engage in a distribution of labor strategy which allows men to continue advancing in their careers through the family building process, while women’s careers stagnate. This is not the only interpretation for what we view here, but it is one possible interpretation.

Male Female
1 27.95 20.41
2 27.95 32.70
3 19.49 22.58
4 15.55 14.43
5 6.53 6.14
6 1.62 2.28
7 0.54 0.85
8 0.26 0.26
9 0.06 0.13
10 0.03 0.05
11 0.03 0.05
12 0.00 0.08
16 0.00 0.03

## # A tibble: 13 x 2
##    Var1  prop.gap
##    <fct>    <dbl>
##  1 1        7.54 
##  2 2       -4.75 
##  3 3       -3.09 
##  4 4        1.12 
##  5 5        0.39 
##  6 6       -0.660
##  7 7       -0.310
##  8 8        0    
##  9 9       -0.07 
## 10 10      -0.02 
## 11 11      -0.02 
## 12 12      -0.08 
## 13 16      -0.03

The bar chart above shows that men are more strongly represented in the lower ends of the distribution for family size (family size = 0-1), while women are more strongly represented for family sizes of 2-3. In the higher ends of the distribution (family size >= 4), the proportional representation of men and women across the different categories of family size is roughly balanced, with the difference in representation vascillating sllightly between men and women up to family size of seven, and then effectively flattening out for family sizes larger than 7.

This pattern is consistent with our second alternative hypothesis, insofar as it suggests that more men in our study had no or very small families than did women. We might expect having smaller families to provide an advantage to men in terms of income earning potential since they are able to focus more of their attention on advancing their careers.

None of the analyses of this section provide decisive evidence in favor of our hypothesis, and in general, the evidence it does provide is pretty modest. In the next section, we’ll look at the wage gap between men and women for the different categories of family size to try to gain a bit more clarity on the magnitude of the advantage that family size might provide in terms of an individual’s income earning potential.

family_size count male female income.gap
1 1756 35378.04 28023.20 7354.846
2 2220 49467.98 30690.20 18777.782
3 1540 58505.22 31567.35 26937.878
4 1093 77328.13 29494.64 47833.484
5 462 76211.10 26978.25 49232.842
6 143 65517.04 21318.51 44198.531
7 51 23473.68 15062.50 8411.184
8 19 21111.11 30720.00 -9608.889
9 7 17500.00 14800.00 2700.000
10 3 0.00 66170.50 -66170.500
11 3 114000.00 3500.00 110500.000
12 3 NaN 0.00 NaN
16 1 NaN 27000.00 NaN
NA 5385 NaN NaN NaN
## Warning: Removed 3 rows containing missing values (position_stack).

The bar chart above shows more clearly the magnitude of the wage gap between men and women across different categories of family size. As previously observed, men earn higher incomes on average for all famiy sizes up to a family size of 7. The high bars at the higher extremes of the distribution for family size represent outliers and do not likely represent trends that are generalizable to the more general population.

Notably, the wage gap between men and women shrinks significantly for larger family sizes (family size between 7-9). This pattern may simply be an artifact of having smaller sample sizes for these categoties (and therefore, larger margins of errors), or it may indicate that having families of this size suppress wages for both men and women equally, although this explanation seems somewhat unlikely. A more plausible explanation may be that larger families tend to correlate positively with age and professional experience, factors that are associated with larger wages for both men and women.

Referring back to the table, we see that, indeed, sample sizes are much smaller for these categories of family size (increasing our margin of error) and wages on average are lower for both men and women. This analysis is most consistent with the first two explanations above, but largely rules out the third explanation.

gender count income spouse_income
Male 6403 53445.91 31076.38
Female 6283 29538.51 56380.51
## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 9192 rows containing non-finite values (stat_smooth).
## Warning: Removed 9192 rows containing missing values (geom_point).

The scatter plots above show men and women’s income plotted against their spouses’ income. When a smoothed curve is added to represent how income varies with spouse’s income, we see markedly different trends for men and women. For men, personal income appears to be positively correlated with their spouse’s income at the higher ranges of the distribution for spouse’s income, while for women, the relationship is almost flat throughout the full range of the distribution. In other words, womens’ income stays about the same on average regardless of how much their spouses earn, while men’s income appears to drop slightly as their spouse’s income increases up to about $30,000, but then increases steadily as their spouse’s incomes increase above $30,000.

This analysis does not provide evidence for the proposal made by our second alternative hypothesis, which proposed that women’s income may be lower than men’s in part because they are strategically distributing the caretaking and “breadwinning” responsibilities with their spouses. If that were happenining to a significant extent, we would expect women’s wages to decrease slightly on average as their spouse’s income increases as some women dropped out of the work force to focus on parenting. Instead, what we find is that women’s incomes tend to increase along with their spouse’s income up about x=$150,000 and decline only when their spouse’s income exceeds $150,000.

This interpretation is slightly more consistent with what we see happening in the case of men, however, at least for lower income couples - that is, men tend to have lower incomes as their spouse’s income increase up to about x=$40,000, possibly reflecting the effects of strategic income sharing to balance household responsibilities (although this is certainly not the only explanation for what we see). Among higher income, couples, however, both partners’ incomes seem to increase together.

Regression Models

(a) Interpretation of Linear Regression Model

## 
## Call:
## lm(formula = income ~ gender + industry + highest_grade + marital_status + 
##     spouse_income, data = nlsy)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -174383  -27993   -5721   15250  332390 
## 
## Coefficients:
##                                                                                   Estimate
## (Intercept)                                                                    50408.63332
## genderFemale                                                                  -38549.57669
## industryAgriculture, Forestry, Fishing, and Hunting                           -17573.60033
## industryMining                                                                  5533.53835
## industryUtilities                                                              16315.67927
## industryConstruction                                                          -12613.21141
## industryManufacturing                                                           7542.68076
## industryWholesale Trade                                                        16255.21939
## industryRetail Trade                                                           -8288.20265
## industryTransportation and Warehousing                                          1899.73611
## industryInformation                                                            -8530.26504
## industryFinance and Insurance                                                  22446.96423
## industryReal Estate and Rental and Leasing                                       127.48170
## industryProfessional, Scientific, and Technical Services                       14724.82624
## industryManagement, Administrative and Support, and Waste Management Services -13202.61740
## industryEducational Services                                                  -20881.38251
## industryArts, Entertainment, and Recreation                                   -26188.22396
## industryAccomodations and Food Services                                        -4452.80819
## industryOther Services (Except Public Administration                          -17688.01379
## industryPublic Administration and Active Duty Military                         -3262.75178
## industryArmed Forces                                                          -35431.90704
## industryUncodeable                                                               703.97973
## highest_gradeNone                                                             -21140.18910
## highest_grade3rd grade                                                        -19725.39031
## highest_grade4th grade                                                        -38391.20419
## highest_grade5th grade                                                        -28408.94459
## highest_grade6th grade                                                        -22375.67412
## highest_grade7th grade                                                        -21874.45298
## highest_grade8th grade                                                        -19361.64800
## highest_grade9th grade                                                        -15572.74528
## highest_grade10th grade                                                       -15232.66764
## highest_grade11th grade                                                        -8061.94008
## highest_grade1st year college                                                   8905.06696
## highest_grade2nd year college                                                  11454.94653
## highest_grade3rd year college                                                  15517.66234
## highest_grade4th year college                                                  41690.83978
## highest_grade5th year college                                                  41089.50464
## highest_grade6th year college                                                  66760.02367
## highest_grade7th year college                                                  69880.27751
## highest_grade8th year college or more                                          98193.07651
## marital_statusMarried                                                           9007.20721
## marital_statusSeparated                                                        -2014.48258
## marital_statusDivorced                                                          2599.34044
## marital_statusWidowed                                                          -1344.96715
## spouse_income                                                                      0.01957
##                                                                                 Std. Error
## (Intercept)                                                                     5767.07279
## genderFemale                                                                    2327.78847
## industryAgriculture, Forestry, Fishing, and Hunting                             9476.83840
## industryMining                                                                 12557.53168
## industryUtilities                                                               9547.29445
## industryConstruction                                                            5117.06663
## industryManufacturing                                                           4162.62515
## industryWholesale Trade                                                         6678.70338
## industryRetail Trade                                                            4505.12703
## industryTransportation and Warehousing                                          5439.48091
## industryInformation                                                             7101.72268
## industryFinance and Insurance                                                   5657.77316
## industryReal Estate and Rental and Leasing                                      8654.90972
## industryProfessional, Scientific, and Technical Services                        5172.94580
## industryManagement, Administrative and Support, and Waste Management Services   5682.71128
## industryEducational Services                                                    4226.14491
## industryArts, Entertainment, and Recreation                                    10187.84149
## industryAccomodations and Food Services                                         6406.99666
## industryOther Services (Except Public Administration                            5661.71730
## industryPublic Administration and Active Duty Military                          4726.67509
## industryArmed Forces                                                           20311.33298
## industryUncodeable                                                             20298.34744
## highest_gradeNone                                                              56854.98364
## highest_grade3rd grade                                                         28587.54679
## highest_grade4th grade                                                         33086.74481
## highest_grade5th grade                                                         56871.00256
## highest_grade6th grade                                                         19062.35620
## highest_grade7th grade                                                         19077.69985
## highest_grade8th grade                                                         11792.94330
## highest_grade9th grade                                                          9288.12509
## highest_grade10th grade                                                         9540.19675
## highest_grade11th grade                                                         8594.04315
## highest_grade1st year college                                                   3795.95729
## highest_grade2nd year college                                                   3575.15019
## highest_grade3rd year college                                                   4756.14441
## highest_grade4th year college                                                   3160.68107
## highest_grade5th year college                                                   5648.47115
## highest_grade6th year college                                                   4776.04189
## highest_grade7th year college                                                   7129.01326
## highest_grade8th year college or more                                           6256.67862
## marital_statusMarried                                                           4762.31574
## marital_statusSeparated                                                         8109.54455
## marital_statusDivorced                                                          6008.68773
## marital_statusWidowed                                                          14997.38040
## spouse_income                                                                      0.02328
##                                                                               t value
## (Intercept)                                                                     8.741
## genderFemale                                                                  -16.561
## industryAgriculture, Forestry, Fishing, and Hunting                            -1.854
## industryMining                                                                  0.441
## industryUtilities                                                               1.709
## industryConstruction                                                           -2.465
## industryManufacturing                                                           1.812
## industryWholesale Trade                                                         2.434
## industryRetail Trade                                                           -1.840
## industryTransportation and Warehousing                                          0.349
## industryInformation                                                            -1.201
## industryFinance and Insurance                                                   3.967
## industryReal Estate and Rental and Leasing                                      0.015
## industryProfessional, Scientific, and Technical Services                        2.847
## industryManagement, Administrative and Support, and Waste Management Services  -2.323
## industryEducational Services                                                   -4.941
## industryArts, Entertainment, and Recreation                                    -2.571
## industryAccomodations and Food Services                                        -0.695
## industryOther Services (Except Public Administration                           -3.124
## industryPublic Administration and Active Duty Military                         -0.690
## industryArmed Forces                                                           -1.744
## industryUncodeable                                                              0.035
## highest_gradeNone                                                              -0.372
## highest_grade3rd grade                                                         -0.690
## highest_grade4th grade                                                         -1.160
## highest_grade5th grade                                                         -0.500
## highest_grade6th grade                                                         -1.174
## highest_grade7th grade                                                         -1.147
## highest_grade8th grade                                                         -1.642
## highest_grade9th grade                                                         -1.677
## highest_grade10th grade                                                        -1.597
## highest_grade11th grade                                                        -0.938
## highest_grade1st year college                                                   2.346
## highest_grade2nd year college                                                   3.204
## highest_grade3rd year college                                                   3.263
## highest_grade4th year college                                                  13.190
## highest_grade5th year college                                                   7.274
## highest_grade6th year college                                                  13.978
## highest_grade7th year college                                                   9.802
## highest_grade8th year college or more                                          15.694
## marital_statusMarried                                                           1.891
## marital_statusSeparated                                                        -0.248
## marital_statusDivorced                                                          0.433
## marital_statusWidowed                                                          -0.090
## spouse_income                                                                   0.841
##                                                                               Pr(>|t|)
## (Intercept)                                                                    < 2e-16
## genderFemale                                                                   < 2e-16
## industryAgriculture, Forestry, Fishing, and Hunting                            0.06378
## industryMining                                                                 0.65949
## industryUtilities                                                              0.08756
## industryConstruction                                                           0.01376
## industryManufacturing                                                          0.07008
## industryWholesale Trade                                                        0.01499
## industryRetail Trade                                                           0.06590
## industryTransportation and Warehousing                                         0.72693
## industryInformation                                                            0.22978
## industryFinance and Insurance                                                 7.43e-05
## industryReal Estate and Rental and Leasing                                     0.98825
## industryProfessional, Scientific, and Technical Services                       0.00445
## industryManagement, Administrative and Support, and Waste Management Services  0.02023
## industryEducational Services                                                  8.19e-07
## industryArts, Entertainment, and Recreation                                    0.01020
## industryAccomodations and Food Services                                        0.48711
## industryOther Services (Except Public Administration                           0.00180
## industryPublic Administration and Active Duty Military                         0.49007
## industryArmed Forces                                                           0.08118
## industryUncodeable                                                             0.97234
## highest_gradeNone                                                              0.71005
## highest_grade3rd grade                                                         0.49025
## highest_grade4th grade                                                         0.24601
## highest_grade5th grade                                                         0.61744
## highest_grade6th grade                                                         0.24056
## highest_grade7th grade                                                         0.25164
## highest_grade8th grade                                                         0.10073
## highest_grade9th grade                                                         0.09372
## highest_grade10th grade                                                        0.11044
## highest_grade11th grade                                                        0.34827
## highest_grade1st year college                                                  0.01904
## highest_grade2nd year college                                                  0.00137
## highest_grade3rd year college                                                  0.00112
## highest_grade4th year college                                                  < 2e-16
## highest_grade5th year college                                                 4.40e-13
## highest_grade6th year college                                                  < 2e-16
## highest_grade7th year college                                                  < 2e-16
## highest_grade8th year college or more                                          < 2e-16
## marital_statusMarried                                                          0.05867
## marital_statusSeparated                                                        0.80383
## marital_statusDivorced                                                         0.66534
## marital_statusWidowed                                                          0.92855
## spouse_income                                                                  0.40063
##                                                                                  
## (Intercept)                                                                   ***
## genderFemale                                                                  ***
## industryAgriculture, Forestry, Fishing, and Hunting                           .  
## industryMining                                                                   
## industryUtilities                                                             .  
## industryConstruction                                                          *  
## industryManufacturing                                                         .  
## industryWholesale Trade                                                       *  
## industryRetail Trade                                                          .  
## industryTransportation and Warehousing                                           
## industryInformation                                                              
## industryFinance and Insurance                                                 ***
## industryReal Estate and Rental and Leasing                                       
## industryProfessional, Scientific, and Technical Services                      ** 
## industryManagement, Administrative and Support, and Waste Management Services *  
## industryEducational Services                                                  ***
## industryArts, Entertainment, and Recreation                                   *  
## industryAccomodations and Food Services                                          
## industryOther Services (Except Public Administration                          ** 
## industryPublic Administration and Active Duty Military                           
## industryArmed Forces                                                          .  
## industryUncodeable                                                               
## highest_gradeNone                                                                
## highest_grade3rd grade                                                           
## highest_grade4th grade                                                           
## highest_grade5th grade                                                           
## highest_grade6th grade                                                           
## highest_grade7th grade                                                           
## highest_grade8th grade                                                           
## highest_grade9th grade                                                        .  
## highest_grade10th grade                                                          
## highest_grade11th grade                                                          
## highest_grade1st year college                                                 *  
## highest_grade2nd year college                                                 ** 
## highest_grade3rd year college                                                 ** 
## highest_grade4th year college                                                 ***
## highest_grade5th year college                                                 ***
## highest_grade6th year college                                                 ***
## highest_grade7th year college                                                 ***
## highest_grade8th year college or more                                         ***
## marital_statusMarried                                                         .  
## marital_statusSeparated                                                          
## marital_statusDivorced                                                           
## marital_statusWidowed                                                            
## spouse_income                                                                    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 56760 on 3067 degrees of freedom
##   (9574 observations deleted due to missingness)
## Multiple R-squared:  0.2822, Adjusted R-squared:  0.2719 
## F-statistic:  27.4 on 44 and 3067 DF,  p-value: < 2.2e-16

In this section, I fit a linear regression model to the relationship between gender and my chosen key variables and interpret the model coefficients. As noted above, several variables included in my part 1 analysis, including jobs_number and family_size, were excluded from my final model due to weak association with the main variables of interest (gender and income) and/or difficulties with interpretability. It should be also noted that the analysis of this section relies on certain assumptions that will not be evaluated until the next section (part (b)), which will determine whether a linear regression is appropriate for modeling the relationship between income and these variables, and correspondingly, whether the standard interpretation of the coefficients is valid.

The first thing to note from the output summary above is that gender is a highly statistically significant predictor of income at a p-value of < 2e-16. Even holding industry, educational attainment, marital status, and spouse’s income constant, being female is assosiciated with a $-38549.58 difference in income compared to being male. Altogether,the statistically significant coefficient estimates in this model include (Intercept), genderFemale, industryConstruction, industryWholesale Trade, industryFinance and Insurance, industryProfessional, Scientific, and Technical Services, industryManagement, Administrative and Support, and Waste Management Services, industryEducational Services, industryArts, Entertainment, and Recreation, industryOther Services (Except Public Administration, highest_grade1st year college, highest_grade2nd year college, highest_grade3rd year college, highest_grade4th year college, highest_grade5th year college, highest_grade6th year college, highest_grade7th year college, highest_grade8th year college or more. Below I provide an interpretation of a select few of these significant variables.

For the interpretations that follow, the baseline for comparison is a male who has never been married, has a high school education (has completed 12 years of education), works in the area of health care and social assistance, and has a spouse with an income of $0. This is, of course, merely a hypothetical scenario and doesn’t necessarily (or actually) represent any individual from our sample population. For ease of interpretation, all subsequent mentions of “holding all other variables constant” should be understood to connote this particular collection of features, save only for the facts that (a) the individual being compared against this baseline is female (and therefore carries a starting salary $-38549.58 lower than the male baseline) and (b) differs in the one additional respect specified (i.e., that for which the coefficient is being interpreted).

Referring to the output summary above, we see that working in Educational Services is associated with a $-59430.96 difference in income compared to being male, holding all other variables constant. Working in Finance and Insurance is associated with a $-16102.61 difference in income compared to being male, holding all other variables constant. Working in Professional, Scientific and Technical Services is associated with a $-23824.75 difference in income compared to being male, holding all other variables constant.

Similarly, with regard to level of educational attainment, we see that having completed 4 years of college is associated with a $3141.26 difference in income compared to being male, holding all other variables constant, while having completed 8 or more years of college is associated with a $59643.5 difference in income compared to being male, holding all other variables constant.

The R-squared for this model is 0.28, meaning that approximately 28.22% of the variability in income is explained by knowing the values of our predictors.

Next, I turn to an evaluation of the linear regression model to better assess the validity of the interpretations provided in this section.

(b) Evaluation of Linear Regression Model

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

In this section, I discuss whether the standard diagnostic plots indicate issues with a linear regression model for gender and my chosen key variables. The issues I’m looking for include such things as trends in residuals, variance issues, outliers, etc.

First of all, when we plot the distribution of residuals for this relationship, we see that both the linearity assumption (i.e., that the residuals look like random scatter around the zero line and there is no evidence of structure or pattern in the residuals) as well as the homoscedasticity assumption (i.e., that there is equal variance in the deviance of each y value from the fitted line) is violated, meaning the relationship cannot be appropriately modeled by a linear regression. Instead, the data have a tendency to concentrate below where the fitted line would predict them to appear, indicating that the mean of the observed values is consistently lower than what we would expect if the relationship were linear.

The assumption of homoscedacity is somewhat more difficult to assess, but there appear to be slight variations in the deviance of y values at certain x values, specifically below x values of about 25,000. There also seems to be significant left to right fanning when we focus just on where the data is most densely concentration, although this fanning is clearly constrained by the rigid bottom limit, and to a somewhat lesser degree, the upper limit as well (likely a consequence of our topcoded values). In light of this analysis, we can conclude that any linear regression model of this model is going to be significantly limited in its predictive power.

The Scale-Location plot provides a more fine-tuned tool for assessing the assumption of homoscedastiticy, i.e., equal variance in the deviance of each y values from the fitted line. If this assumption were upheld, the red line running through the points would be approximately flat in the horizontal direction. However, that’s not what we see in our plot, indicating that we do not have equal variance in the deviation of our y values from the fitted line across all values of x. This analysis further supports our conclusion from above that a linear model of this relationship is not appropriate.

Next, we consider the Q-Q-plot. Q-Q plots take the sample data, sort it in ascending order, and then plot them versus quantiles calculated from a theoretical distribution (https://data.library.virginia.edu/understanding-q-q-plots/). The superimposed line represents where the data would be expected to fall if its underlying distribution was normal. The Q-Q-plot above shows that the sample data does not conform to a normal distribution, as indicated by the sharp deviation of observed values above the 1.5 quantile values. There is also slight skewing at the lower end of the distribution, although this appears to be within an acceptable range.

Finally, the Residuals vs. Leverage plots allows us to identify influential data points in our model. The points we’re most concerned about are values in the upper right or lower right corners, which are outside the red dashed Cook’s distance line. These are points that would be influential in the model, possibly distorting our estimations. If such points were present, we’d want to consider removing them in order to get more accurate estimates from our model (https://medium.com/data-distilled/residual-plots-part-4-residuals-vs-leverage-plot-14aeed009ef7). It appears as though there may be one point (labeled 2681) that lies either on or just outside this boundary, suggesting our model might be improved by removing it. Because it’s on the cusp, however, I’ve decided to leave it.

(c) Assessment of Statistical Significance of Key Variables

In this section, I use the anova() function to assess whether each of the variables included in my model is a statistically significant predictor of income by comparing my model from part (a) to a model where each of these variables is excluded.

## Analysis of Variance Table
## 
## Model 1: income ~ gender + industry + highest_grade + marital_status + 
##     spouse_income
## Model 2: income ~ gender + highest_grade + marital_status + spouse_income
##   Res.Df            RSS  Df     Sum of Sq      F    Pr(>F)    
## 1   3067  9880161250758                                       
## 2   3087 10337872794679 -20 -457711543921 7.1041 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Model 1: income ~ gender + industry + highest_grade + marital_status + 
##     spouse_income
## Model 2: income ~ gender + industry + marital_status + spouse_income
##   Res.Df            RSS  Df      Sum of Sq      F    Pr(>F)    
## 1   3067  9880161250758                                        
## 2   3085 11647196744774 -18 -1767035494015 30.474 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Model 1: income ~ gender + industry + highest_grade + marital_status + 
##     spouse_income
## Model 2: income ~ gender + industry + highest_grade + spouse_income
##   Res.Df           RSS Df    Sum of Sq     F Pr(>F)  
## 1   3067 9880161250758                               
## 2   3071 9907054362983 -4 -26893112225 2.087 0.0799 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Model 1: income ~ gender + industry + highest_grade + marital_status + 
##     spouse_income
## Model 2: income ~ gender + industry + highest_grade + marital_status
##   Res.Df           RSS Df   Sum of Sq      F Pr(>F)
## 1   3067 9880161250758                             
## 2   3068 9882437635679 -1 -2276384921 0.7066 0.4006

First, I apply the anova function to the industry variable. From this analysis, we see that industry is highly statistically significant at a p-value of 6.883115910^{-20}. If a linear regression were appropriate for modeling this relationship, we would therefore be able to reject the null hypothesis that the income gap is same across all industries. In other words, the data suggests that the income gap between men and women varies with the industry in which one works.

When I apply the anove function to the highest_grade variable, we see that it, too, is highly statistically significant at a p-value of 5.90507810^{-96}. If a linear regression were appropriate for modeling this relationship, we would therefore be able to reject the null hypothesis that the income gap is same across all levels of educational attainment. In other words, the data suggests that the income gap between men and women varies with level of education.

When I apply the anove function to the marital_status variable, however, we find that it is not statistically significant, having a p-value of 0.0799031. If a linear regression were appropriate for modeling this relationship, we would therefore not be able to reject the null hypothesis that the income gap is same across all levels of marital status. In other words, we do not have evidence that the income gap between men and women varies depending on a person’s marital status.

Finally, when I apply the anove function to the spouse_income variable, we see that it, too, is not statistically significant, having a p-value of 0.4006285. If a linear regression were appropriate for modeling this relationship, we would not be able to reject the null hypothesis that the income gap is same across all levels of spouse’s income. In other words, the data does not indicate that the income gap between men and women varies with their spouse’s income.

(d) Main Effects vs. Interaction Effects

## Analysis of Variance Table
## 
## Model 1: income ~ gender + industry + highest_grade + marital_status + 
##     spouse_income
## Model 2: income ~ gender + industry + highest_grade + marital_status + 
##     spouse_income + gender:industry
##   Res.Df           RSS Df    Sum of Sq      F Pr(>F)  
## 1   3067 9880161250758                                
## 2   3047 9760960253620 20 119200997138 1.8605 0.0114 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Model 1: income ~ gender + industry + highest_grade + marital_status + 
##     spouse_income
## Model 2: income ~ gender + industry + highest_grade + marital_status + 
##     spouse_income + gender:highest_grade
##   Res.Df           RSS Df    Sum of Sq      F    Pr(>F)    
## 1   3067 9880161250758                                     
## 2   3052 9618402780284 15 261758470474 5.5372 2.852e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Model 1: income ~ gender + industry + highest_grade + marital_status + 
##     spouse_income
## Model 2: income ~ gender + industry + highest_grade + marital_status + 
##     spouse_income + gender:marital_status
##   Res.Df           RSS Df   Sum of Sq     F  Pr(>F)  
## 1   3067 9880161250758                               
## 2   3063 9848920220902  4 31241029856 2.429 0.04573 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this study, we are exploring the question of whether there are any factors that exacerbate or mitigate the income gap between men and women. This is different from asking whether there are factors that affect income. In the latter case, we are only estimating the so-called main effects of each variable, while in the latter case, we are measuring the interaction effects, i.e., the effect that emerges when two variables appear together. The specific interaction effects that we’re interested in are those combining gender and one of our other chosen variables. By looking at the individual p-values for the interaction term coefficients, we can answer the question of whether the difference in income gap differs across different levels of these key variables (http://www.andrew.cmu.edu/user/achoulde/94842/lectures/lecture11/lecture11-94842.html).

The p-value for the first interaction variable (industry * gender) is statistically significant at a value of 0.0114029. If a linear regression were appropriate for modeling this relationship, we would therefore be able to reject the null hypothesis that the income gap is same across all industries. In other words, the data suggests that the income gap between men and women varies with industry.

The p-value for the second interaction variable (highest_grade * gender) is statistically significant at a value of 2.852343410^{-11}. If a linear regression were appropriate for modeling this relationship, we would therefore be able to reject the null hypothesis that the income gap is same across all levels of educational attainment. In other words, the data suggests that the income gap between men and women varies with educational attainment.

The p-value for the third interaction variable (marital_status * gender) is also statistically significant at a value of 0.0457257. If a linear regression were appropriate for modeling this relationship, we would therefore be able to reject the null hypothesis that the income gap is same across all levels of educational attainment. In other words, the data suggests that the income gap between men and women varies with educational attainment.

2. Methodology

This section details the approach I took in exploring and analyzing the data. Here, I tell the story of how I got to main conclusions and address some of the twists and turns I encountered along the way.

(a) Missing Values

Negative values were used in the NLSY 1979 Cohort Survey to indicate various non-response scenarios, including refusal to answer (-1), don’t know (-2), invalid skip (-3), valid skip (-4), and non-interview (-5). For the purposes of this analysis, these values were replaced with NA values to preserve the integrity of statistical summaries. This approach, while imperfect, was judged to be less problematic than any of the alternative approaches, such as systematically assigning non-responses to one or more of the other response categories, which would rely on assumptions I didn’t feel adequately equipped to justify.

Each of these responses has potentially different significance to our analysis depending on the research question. For example, I found that male respondents were significantly more likely than female respondent to refuse to report their level of educational attainment. While it’s possible that individuals responding in this way would have had roughly the same distribution across the various categories as those that did respond, it’s perhaps more likely that the no-response bin contained a greater proportion of low-attainment individuals than those that provided responses and were included (due, perhaps to the embarrassment of disclosing low educational attainment or fear of negative outcomes). If this assumption were correct, then a disparity in the numbers of non-responders would be expected to bias the model of the relationship between income and educational attainment upward relative to the true relationship.

Omitted data is more likely to bias our models whenever they represent large proportions of sample responses for that question and are more problematic when they are non-random (as assumed for the case above) than when they’re random. Such omissions lead to classical and non-classical measurement error, which introduce bias to our models and inaccurate interpretations of model coefficients.

Similarly, omitted data can negatively impact the validity of the resulting analysis by introducing selection bias. As noted above, the decision not to reply to a particular question might indicate some important difference between the responding and non-responding portions of the population (e.g., mistrust of the interviewer, low self-esteem, etc.). Such non-random differences between segments of the sample population would therefore act as a confounder and compromise the internal validity of our study.

(b) Topcoded Variables

In the NLSY79 Survey data, the variable that serves as the outcome for this analysis, TOTAL INCOME FROM WAGES AND SALARY IN PAST CALENDAR YEAR (TRUNC) (2012 survey question), was topcoded, meaning we do not get to see the actual incomes for the top 2% of earners. Survey data are often topcoded before release to the public to preserve the anonymity of respondents and to prevent possibly-erroneous outliers from being published (see “Top-coded,” Wikipedia). For the purposes of this analysis, I chose to retain the top coded data primarily for the purpose of excluding outliers. Even if these data aren’t erroneous (i.e., even if they correspond to actual respondents), the presence of outliers can skew the data and obscure more generalizable patterns and trends.

One worry with this approach is that, by exlcuding extreme outliers, we actually risk concealing one of the most notable trends in the relationship between gender and income, namely the lack of representation of women in the highest paying (especially executive level) positions. While this is indeed a weakness with the approach, I felt that this might represent a somewhat unique case with a slightly different causal story than the one that applies more generally to the population.

(c) Deadend Tables and Plots

In my evaluation of the second alternative hypothesis, I did several analyses examining the relationship between income and number of jobs held by respondents. I assumed, perhaps mistakenly, that this variable could be used as proxy for experience, and therefore, would correlate positively with income. However, the data did not support this assumption, suggesting that the variable may not have measured what I expected it to.

My analysis of the relationship between income and family size also turned out not to be as informative as I expected. In this case, I took family size to be a proxy for the number of children the respondent had, although the metadata did not make this explicit. In whatever case, I wasn’t able to discern much meaning in my analysis of that relationship, particularly in reference to the hypothesis I enlisted it to test.

Finally, I expected my analysis of the relationship between industry and income to shed more light on the wage gap between men and women than it did. While I was able to extract some interesting insights from that analysis, I felt that other interesting insights were still obscured at that level of generality. In order to get at these insights (or at least to satisfy my curiosity), I would have also liked to include a more granular analysis of the proportional representation of each gender in different professions using the Occupation variable from the base data set.

(d) Abandoned and Excluded Analyses

I investigated a number of relationships that, for various reasons, don’t appear in my findings sections. One of these is the relationship between race and income. I created several tables and plots to examine the effect that race had on the wage gap observed between men and women and found that, indeed, a woman’s race had a substantial impact on how large a difference in income they were likely to experience relative to their male counterparts. While interesting in its own right, I didn’t feel like this analysis shed much light on the central question of this study, which asked whether gender alone influenced how much a person earned in income. The primary utility of the race analysis for my purposes was to rule out the possibility that race was serving as a confounder in the relationship between gender and income, which we would have expected in the case that male and female respondents were uneqully distributed among the race categories. This, however, turned out not to be the case, so I decided to exclude the analysis from the report of my findings.

I also ran several analyses looking specifically at high wage earners (individuals earning above $50,000/yr). In the end, I felt this analysis was too narrow in its focus and detracted from the more general trends I was interested in exploring.

Finally, because my analyses of the jobs_number and family_size variables didn’t yield the depth of insight I was hoping, I decided to exclude them from my findings as well.

(e) Final Analysis

For this study, I set out to look for evidence that might refute The null hypothesis that there is no significant difference in income between men and women and that, rather, any difference observed is completely explainable by differences in other factors between these two groups. In my final analysis, I chose to focus on the two alternative hypotheses I felt told the most comprehensive story about why men and women might earn different incomes - that is, if, in fact, those differences weren’t the result of discriminatory practices. In other words, I wanted to explore the possibility that the differences observed were due to factors other than the person’s gender, and to examine these factors in the broader context I would expect to find them.

The first of these alternative hypotheses is that men and women earn different incomes on average because of their professional qualifications and/or occupational choices. According to this account, men and women will tend to earn different incomes on account of one gender having higher qualifications in terms of either higher educational attainment or more professional experience (or both), which is rewarded with higher-paying positions, and/or on account of one gender choosing to work in industries with lower-than-average prospective salaries. All else held equal, I would expect both of these mechanisms to work in the same direction resulting in a larger wage gap between the genders than if only one of them were operative, or if they favored the genders disparately (e.g., the first favoring women and the second favoring men, or vice versa).

To test this hypothesis, I looked at the proportional representation of each gender across the different levels of educational attainment (highest grade completed), number of jobs worked (job numbers), and industry of employment (industry). I also ran a multiple linear regression on the key variables, controlling for all other factors. This allowed me to estimate the magnitude of the association between each of these variables independently and whether it was statistically significant at the p<0.05 level. (For the final analysis, my examination of the relationship between income, gender, and jobs_number was excluded for reasons discussed elsewhere in this report.)

The second alternative hypothesis I chose to test is that the difference in income between men and women is due to family dynamics. According to this account, any differences in income observed between men and women are due to decisions individuals make cooperatively with some other member of their family unit. The most common example I would expect to see here is when two partners, whether explicitly or implicitly, adopt a strategy of distributing the essential functions of the family between themselves, with one taking a greater responsibility for the family’s finances (the so-called “breadwinner” role) and the other taking a greater responsibility for parenting duties and maintaining the home. I would expect this pattern to be most pervasive among respondents who are married with larger families. I would also expect this strategy to emerge in relationships in which one spouse earns a relatively high income (>$50k annually),

In order to test this hypothesis, I analyzed the proportional representation of each gender across the various categories of marital status, family size, and spouse’s income. I also ran a multiple linear regression on the key variables, controlling for all other factors. This allowed me to estimate the magnitude of the association between each of these variables independently and whether it was statistically significant at the p<0.05 level. (For the final analysis, my examination of the relationship between income, gender, and family_size was excluded for reasons discussed elsewhere in this report.)

With several variables I examined, I was unable to perform the analyses I initially intended, mostly due to insufficient or missing data. The main consequence of this difficulty was that I had to substitute the more direct analysis I intended, which would have assessed the influence of the variable directly on the male-female wage gap, for something more roundabout. The approach I ultimately settled on was to break this analysis into two separate steps. In the first step, I calculated the relationship between the variable of interest and income, without special regard to gender. In the second step, I examined the proportional representation of the genders across the various levels of the variable, watching out for any imbalances that might distort the apparent relationship between that predictor and the outcome. While not ideal, this strategy allowed me to draw inferences as to whether the wage gap between men and women may be exaggerated, or even completely fabricated, by systematic differences in terms of these other factors between the two genders. In short, it allowed me to assess the likelihood that these other factors were acting as confounders in the analysis of the relationship between gender and income.

3. Findings

In this section, I provide a careful presentation of my main findings concerning the problem of income inequality between men and women.

Tabular and Graphical Summaries

(a) Alternative Hypothesis 1: Professional Qualifications and Occupational Choice

The first bar chart above shows that the highest paying industries, on average, are Finance and Insurance, Professional, Scientific, and Technical Services, Information, and Utilities while the lowest paying industries are Management, Administrative Support, and Waste Management Services, Construction, and Accommodations and Food Services.

A side-by-side comparison of average income across industries by gender shows a slight-to-substantial advantage for men across most industries. This analysis suggests that the wage gap between men and women is not due to differences in choice of industry between the two groups, insofar as men tend to outearn women independently of what industry they’re in. The few exceptions are in the areas of Real Estate and Rental and Leasing and the Armed Forces, where women slightly outearn men on average.

Even in those areas where women are more strongly represented, such as Health Care and Social Assistance and Educational Services, men still tend to earn more on average (see next section’s analysis).

The bar chart above shows that the representation gap between men and women varies across industries, with women being more strongly represented in such industries as Health Care and Social Assistance and Educational Services and men being more strongly represented in Construction and Manufacturing. If those industries for which men were more strongly represented also tended to correspond to higher salaries on average, then this disparity might partially explain the wage gap we observe between men and women. If there is no such correlation, however, then this would not be a likely explanation for the gap we observe.

Referring back to the previous section’s analysis, we find no such correlation between high paying professions and representativeness. Men predominate in two out of the three lowest paying industries noted above, while women predominate in the highest paying industry, Finance and Insurance, are equally represented in the second highest paying industry, and are only slightly underrepresented in the remaining two highest-paying industries. While this analysis doesn’t eliminate the possibility that people’s choice of industry contributes to the wage gap we observe between men and women, it does somewhat weaken the case in favor of that explanation.

In the final section of my analysis of the relationship between income and industry, I compare the wage gap within each industry to further validate the results of the previous sections’ analyses.

## Warning: Removed 1 rows containing missing values (position_stack).

The bar chart above shows that wage gap between men and women is far more pronounced in certain industries than in others, with most of these disparities favoring men. The widest disparities in earnings are in the areas of Finance and Insurance, Professional, Scientific and Technical Services, and Information, while the smallest disparities are in the areas of Construction, Real Estate and Rental and Leasing, and Armed Forces, with the latter two categories tending to favor women. In other words, in the industries in which women have an advantage, that advantage tends to be very modest, where the advantage is much larger in those industries that favor men. This analysis provides further evidence that occupational choice is likely not driving the difference we observe in the wages of men and women, although it is an important qualifier, for the reasons just mentioned.

Next, let’s look at the relationship between income and educational attainment (highest_grade).

## Warning: Ignoring 5662 observations
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors

## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
## Warning: Removed 2 rows containing missing values (geom_bar).

In the boxplot provided above, you can see the generally positive effects of additional years of education on average income earned. The boxplot also allows us to see how additional years of education influence the range of incomes that become accessible to people in each class. Many of the higher income categories (e.g., above $100k/year) are reserved almost excusively for those possessing at least a high school diploma (i.e., completed up to 12 years of education). Around the $350k mark, you can see the top-coded values of those earning significantly more than the bulk of the distribution for each class. We can therefore assume that the real average for these higher educational levels is actually somewhat higher than what is displayed, although such outcomes are rare.

The bar chart similarly shows a positive correlation between level of education attainment and average income. With a few minor exceptions, average income tends to increase with every additional level of educational attainment, for both men and women. There are a few minor deviations from this trend among levels of grade school as well as college, but some of these differences likely fall within the margin of error for those measurements, so should not be interpreted as significant. Among the major classes of educational attainment, e.g., from grade school to a bachelors degree, and between different levels of higher education, the difference is much more significant.

Notably, the positive effect of educational attainment on income is much more pronounced for men than women, a pattern that holds across virtually every category of education. The sole exception is for those with a 4th grade education, though again, this difference is likely within the margin of error for this category (n = 9), and therefore should not be interpreted as significant.

The two bar charts above show the male-to-female proportional representation per level of educational attainment and male - female difference in proportional representation per level of educational attainment, respectively. While male and female respondents were represented nearly equally across all categories, the slight differences that do exist are telling. Specifically, we find that men are more strongly represented among those who completed up to some high school (9th to 12th grade) and 8 or more years of college, while women are more strongly represented among those who completed up to some college (1 to 7 years). In other words, female respondents were on average better educated than men across the entire sample.

This analysis, like the previous one, provides evidence against the first alternative hypothesis, which proposed that women may be earning lower incomes because of lower educational attainment compared to men. In fact, what this analysis shows is that men are earning more despite having lower educational qualifications than their female counterparts, which is precisely the opposite of what this hypothesis predicted.

In the final section of my analysis of the relationship between income and professional qualifications, I compare the wage gap within each level of educational attainment to further validate the results of the previous sections’ analyses.

## Warning: Removed 2 rows containing missing values (position_stack).

The bar chart above shows that wage gap between men and women increases as level of educational attainment increases, in favor of men. We see slight drops at irregular intervals, such as 5 years and 7 years of college, which might represent individuals who stopped short of completing a higher level degree, such as a masters, doctorate, or professional degree. Alternatively, it may just represent a small sample size - and therefore larger margin of error - for these categories.

This analysis provides a first line of evidence that professional qualifications are likely not driving the difference we observe in the wages of men and women, insofar as men are benefiting more on average from the positive relationship between educational attainment and income, despite women having the stronger educational credentials on average. The last factor we’ll consider in our evaluation of the first alternative hypothesis is number of jobs, which is being used here as a proxy for professional experience.

(b) Alternative Hypothesis 2: Family Dynamics

## Warning: Ignoring 6177 observations
## Warning: Removed 5662 rows containing non-finite values (stat_ydensity).

The bar chart above shows the average income per category of marital status, separated by gender. You can see that men earn higher incomes on average than women across all categories of martial status. The difference is most pronounced among married individuals and least pronounced among those who have never been married. The first alternative hypothesis offers a possible explanation for this trend, which is that single individuals, whether male or female, are less likely to be burdened by the responsibilities of parenthood and therefore can devote more energy and attention to their careers, and perhaps even compete for more competitive high paying jobs. In contrast, married individuals are more likely both to have children, as well as to share incomes with their partners. Both of these factors - i.e., having larger families and sharing income with their spouse) would, according to this hypothesis, lead us to expect a decrease in the wages of women relative to men, as couples shift the burdens of parenthood disproportionately onto one partner to allow the remaining partner to fill the role of “breadwinner.” The analyses that follow will help evaluate whether the data suports this explanation.

The boxplot similarly shows higher median incomes for married respondents compared to all other categories, with the upper fence value reaching significantly higher than those of the other categories. The violin plot shows how the population is distributed throughout the various portions of the range for income, with married, divorced and widowed categories being the only categories to feature a somewhat even concentrations of the population into the higher income ranges, with all other categories tapering off pretty precipitously as incomes increase.

The bar chart above shows that men are more strongly represented among those respondents that have never been married, while women predominate in every other category, albeit by only a slight (<3%) margin. Our second alternative hypothesis proposed that the wage gap observed between men and women might be partially explained on account of a larger proportion of women being married relative to men. This analysis provides very weak evidence for that hypothesis since approximately 6% more of male respondents were single (never married) and women were very slightly (<.25%) more likely to be married. Likewise, a larger proportion of female respondents tended to be either separated or divorced than male respondents. Insofar as these categories correlate positively with shared incomes and/or having children, then our hypothesis would receive somewhat stronger evidence in its favor.

The bar chart above shows that there is a statistically significant difference in the incomes of men and women for the separated, divorced, and married categories, but no statistically significant difference for the never married and widowed categories. The most notable difference by a wide margin is for the married category, where the mean difference in income between men and women is $37937.5157642, in favor of men.

However, since men and women are represented roughly equally within this category (refer to previous section’s analysis), this disparity does not help to explain why men tend to earn higher incomes than women.

## `geom_smooth()` using method = 'gam' and formula 'y ~ s(x, bs = "cs")'
## Warning: Removed 9192 rows containing non-finite values (stat_smooth).
## Warning: Removed 9192 rows containing missing values (geom_point).

The scatter plots above show men and women’s income plotted against their spouses’ income. When a smoothed curve is added to represent how income varies with spouse’s income, we see markedly different trends for men and women. For men, personal income appears to be positively correlated with their spouse’s income at the higher ranges of the distribution for spouse’s income, while for women, the relationship is almost flat throughout the full range of the distribution. In other words, womens’ income stays about the same on average regardless of how much their spouses earn, while men’s income appears to drop slightly as their spouse’s income increases up to about $30,000, but then increases steadily as their spouse’s incomes increase above $30,000.

This analysis does not provide evidence for the proposal made by our second alternative hypothesis, which proposed that women’s income may be lower than men’s in part because they are strategically distributing the caretaking and “breadwinning” responsibilities with their spouses. If that were happenining to a significant extent, we would expect women’s wages to decrease slightly on average as their spouse’s income increases as some women dropped out of the work force to focus on parenting. Instead, what we find is that women’s incomes tend to increase along with their spouse’s income up about x=$150,000 and decline only when their spouse’s income exceeds $150,000.

This interpretation is slightly more consistent with what we see happening in the case of men, however, at least for lower income couples - that is, men tend to have lower incomes as their spouse’s income increase up to about x=$40,000, possibly reflecting the effects of strategic income sharing to balance household responsibilities (although this is certainly not the only explanation for what we see). Among higher income, couples, however, both partners’ incomes seem to increase together.

Regression Models

(a) Interpretation of Linear Regression Model

In this section, I fit a linear regression model to the relationship between gender and my chosen key variables and interpret the model coefficients. As noted above, several variables included in my part 1 analysis, including jobs_number and family_size, were excluded from my final model due to weak association with the main variables of interest (gender and income) and/or difficulties with interpretability. It should be also noted that the analysis of this section relies on certain assumptions that will not be evaluated until the next section (part (b)), which will determine whether a linear regression is appropriate for modeling the relationship between income and these variables, and correspondingly, whether the standard interpretation of the coefficients is valid.

The first thing to note from the output summary above is that gender is a highly statistically significant predictor of income at a p-value of < 2e-16. Even holding industry, educational attainment, marital status, and spouse’s income constant, being female is assosiciated with a $-38549.5766872 difference in income compared to being male. Altogether,the statistically significant coefficient estimates in this model include (Intercept), genderFemale, industryConstruction, industryWholesale Trade, industryFinance and Insurance, industryProfessional, Scientific, and Technical Services, industryManagement, Administrative and Support, and Waste Management Services, industryEducational Services, industryArts, Entertainment, and Recreation, industryOther Services (Except Public Administration, highest_grade1st year college, highest_grade2nd year college, highest_grade3rd year college, highest_grade4th year college, highest_grade5th year college, highest_grade6th year college, highest_grade7th year college, highest_grade8th year college or more. Below I provide an interpretation of a select few of these significant variables.

For the interpretations that follow, the baseline for comparison is a male who has never been married, has a high school education (has completed 12 years of education), works in the area of health care and social assistance, and has a spouse with an income of $0. This is, of course, merely a hypothetical scenario and doesn’t necessarily (or actually) represent any individual from our sample population. For ease of interpretation, all subsequent mentions of “holding all other variables constant” should be understood to connote this particular collection of features, save only for the facts that (a) the individual being compared against this baseline is female (and therefore carries a starting salary $-38549.58 lower than the male baseline) and (b) differs in the one additional respect specified (i.e., that for which the coefficient is being interpreted).

Referring to the output summary above, we see that working in Educational Services is associated with a $-59430.96 difference in income compared to being male, holding all other variables constant. Working in Finance and Insurance is associated with a $-16102.61 difference in income compared to being male, holding all other variables constant. Working in Professional, Scientific and Technical Services is associated with a $-23824.75 difference in income compared to being male, holding all other variables constant.

Similarly, with regard to level of educational attainment, we see that having completed 4 years of college is associated with a $3141.26 difference in income compared to being male, holding all other variables constant, while having completed 8 or more years of college is associated with a $59643.5 difference in income compared to being male, holding all other variables constant.

The R-squared for this model is 0.2821971, meaning that approximately 28.219712% of the variability in income is explained by knowing the values of our predictors.

Next, I turn to an evaluation of the linear regression model to better assess the validity of the interpretations provided in this section.

(b) Evaluation of Linear Regression Model

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

In this section, I discuss whether the standard diagnostic plots indicate issues with a linear regression model for gender and my chosen key variables. The issues I’m looking for include such things as trends in residuals, variance issues, outliers, etc.

First of all, when we plot the distribution of residuals for this relationship, we see that both the linearity assumption (i.e., that the residuals look like random scatter around the zero line and there is no evidence of structure or pattern in the residuals) as well as the homoscedasticity assumption (i.e., that there is equal variance in the deviance of each y value from the fitted line) is violated, meaning the relationship cannot be appropriately modeled by a linear regression. Instead, the data have a tendency to concentrate below where the fitted line would predict them to appear, indicating that the mean of the observed values is consistently lower than what we would expect if the relationship were linear.

The assumption of homoscedacity is somewhat more difficult to assess, but there appear to be slight variations in the deviance of y values at certain x values, specifically below x values of about 25,000. There also seems to be significant left to right fanning when we focus just on where the data is most densely concentration, although this fanning is clearly constrained by the rigid bottom limit, and to a somewhat lesser degree, the upper limit as well (likely a consequence of our topcoded values). In light of this analysis, we can conclude that any linear regression model of this model is going to be significantly limited in its predictive power.

The Scale-Location plot provides a more fine-tuned tool for assessing the assumption of homoscedastiticy, i.e., equal variance in the deviance of each y values from the fitted line. If this assumption were upheld, the red line running through the points would be approximately flat in the horizontal direction. However, that’s not what we see in our plot, indicating that we do not have equal variance in the deviation of our y values from the fitted line across all values of x. This analysis further supports our conclusion from above that a linear model of this relationship is not appropriate.

Next, we consider the Q-Q-plot. Q-Q plots take the sample data, sort it in ascending order, and then plot them versus quantiles calculated from a theoretical distribution (https://data.library.virginia.edu/understanding-q-q-plots/). The superimposed line represents where the data would be expected to fall if its underlying distribution was normal. The Q-Q-plot above shows that the sample data does not conform to a normal distribution, as indicated by the sharp deviation of observed values above the 1.5 quantile values. There is also slight skewing at the lower end of the distribution, although this appears to be within an acceptable range.

Finally, the Residuals vs. Leverage plots allows us to identify influential data points in our model. The points we’re most concerned about are values in the upper right or lower right corners, which are outside the red dashed Cook’s distance line. These are points that would be influential in the model, possibly distorting our estimations. If such points were present, we’d want to consider removing them in order to get more accurate estimates from our model (https://medium.com/data-distilled/residual-plots-part-4-residuals-vs-leverage-plot-14aeed009ef7). It appears as though there may be one point (labeled 2681) that lies either on or just outside this boundary, suggesting our model might be improved by removing it. Because it’s on the cusp, however, I’ve decided to leave it.

(c) Assessment of Statistical Significance of Key Variables

In this section, I use the anova() function to assess whether each of the variables included in my model is a statistically significant predictor of income by comparing my model from part (a) to a model where each of these variables is excluded.

## Analysis of Variance Table
## 
## Model 1: income ~ gender + industry + highest_grade + marital_status + 
##     spouse_income
## Model 2: income ~ gender + highest_grade + marital_status + spouse_income
##   Res.Df            RSS  Df     Sum of Sq      F    Pr(>F)    
## 1   3067  9880161250758                                       
## 2   3087 10337872794679 -20 -457711543921 7.1041 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Model 1: income ~ gender + industry + highest_grade + marital_status + 
##     spouse_income
## Model 2: income ~ gender + industry + marital_status + spouse_income
##   Res.Df            RSS  Df      Sum of Sq      F    Pr(>F)    
## 1   3067  9880161250758                                        
## 2   3085 11647196744774 -18 -1767035494015 30.474 < 2.2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Model 1: income ~ gender + industry + highest_grade + marital_status + 
##     spouse_income
## Model 2: income ~ gender + industry + highest_grade + spouse_income
##   Res.Df           RSS Df    Sum of Sq     F Pr(>F)  
## 1   3067 9880161250758                               
## 2   3071 9907054362983 -4 -26893112225 2.087 0.0799 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Model 1: income ~ gender + industry + highest_grade + marital_status + 
##     spouse_income
## Model 2: income ~ gender + industry + highest_grade + marital_status
##   Res.Df           RSS Df   Sum of Sq      F Pr(>F)
## 1   3067 9880161250758                             
## 2   3068 9882437635679 -1 -2276384921 0.7066 0.4006

First, I apply the anova function to the industry variable. From this analysis, we see that industry is highly statistically significant at a p-value of 6.883115910^{-20}. If a linear regression were appropriate for modeling this relationship, we would therefore be able to reject the null hypothesis that the income gap is same across all industries. In other words, the data suggests that the income gap between men and women varies with the industry in which one works.

When I apply the anove function to the highest_grade variable, we see that it, too, is highly statistically significant at a p-value of 5.90507810^{-96}. If a linear regression were appropriate for modeling this relationship, we would therefore be able to reject the null hypothesis that the income gap is same across all levels of educational attainment. In other words, the data suggests that the income gap between men and women varies with level of education.

When I apply the anove function to the marital_status variable, however, we find that it is not statistically significant, having a p-value of 0.0799031. If a linear regression were appropriate for modeling this relationship, we would therefore not be able to reject the null hypothesis that the income gap is same across all levels of marital status. In other words, we do not have evidence that the income gap between men and women varies depending on a person’s marital status.

Finally, when I apply the anove function to the spouse_income variable, we see that it, too, is not statistically significant, having a p-value of 0.4006285. If a linear regression were appropriate for modeling this relationship, we would not be able to reject the null hypothesis that the income gap is same across all levels of spouse’s income. In other words, the data does not indicate that the income gap between men and women varies with their spouse’s income.

(d) Main Effects vs. Interaction Effects

## Analysis of Variance Table
## 
## Model 1: income ~ gender + industry + highest_grade + marital_status + 
##     spouse_income
## Model 2: income ~ gender + industry + highest_grade + marital_status + 
##     spouse_income + gender:industry
##   Res.Df           RSS Df    Sum of Sq      F Pr(>F)  
## 1   3067 9880161250758                                
## 2   3047 9760960253620 20 119200997138 1.8605 0.0114 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Model 1: income ~ gender + industry + highest_grade + marital_status + 
##     spouse_income
## Model 2: income ~ gender + industry + highest_grade + marital_status + 
##     spouse_income + gender:highest_grade
##   Res.Df           RSS Df    Sum of Sq      F    Pr(>F)    
## 1   3067 9880161250758                                     
## 2   3052 9618402780284 15 261758470474 5.5372 2.852e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## Analysis of Variance Table
## 
## Model 1: income ~ gender + industry + highest_grade + marital_status + 
##     spouse_income
## Model 2: income ~ gender + industry + highest_grade + marital_status + 
##     spouse_income + gender:marital_status
##   Res.Df           RSS Df   Sum of Sq     F  Pr(>F)  
## 1   3067 9880161250758                               
## 2   3063 9848920220902  4 31241029856 2.429 0.04573 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

In this study, we are exploring the question of whether there are any factors that exacerbate or mitigate the income gap between men and women. This is different from asking whether there are factors that affect income. In the latter case, we are only estimating the so-called main effects of each variable, while in the latter case, we are measuring the interaction effects, i.e., the effect that emerges when two variables appear together. The specific interaction effects that we’re interested in are those combining gender and one of our other chosen variables. By looking at the individual p-values for the interaction term coefficients, we can answer the question of whether the difference in income gap differs across different levels of these key variables (http://www.andrew.cmu.edu/user/achoulde/94842/lectures/lecture11/lecture11-94842.html).

The p-value for the first interaction variable (industry * gender) is statistically significant at a value of 0.0114029. If a linear regression were appropriate for modeling this relationship, we would therefore be able to reject the null hypothesis that the income gap is same across all industries. In other words, the data suggests that the income gap between men and women varies with industry"

The p-value for the second interaction variable (highest_grade * gender) is statistically significant at a value of 2.852343410^{-11}. If a linear regression were appropriate for modeling this relationship, we would therefore be able to reject the null hypothesis that the income gap is same across all levels of educational attainment. In other words, the data suggests that the income gap between men and women varies with educational attainment."

The p-value for the third interaction variable (marital_status * gender) is also statistically significant at a value of 0.0457257. If a linear regression were appropriate for modeling this relationship, we would therefore be able to reject the null hypothesis that the income gap is same across all levels of educational attainment. In other words, the data suggests that the income gap between men and women varies with educational attainment."

4. Discussion

In this section, I summarize my main conclusions and discuss potential limitations of my analysis and findings, beginning with potential confounders.

(a) Potential Confounders

As noted throughout this report, there are a number of potential confounders I may not have accounted for in my analysis and limit the validity of my finaly analysis. The first of these, which I’ve already described in some detail, is that represented by the missing values. If these non-responses aren’t random, i.e., if they aren’t balanced across the various segments of our sample population, then they may introduce bias into our model by obscuring important differences in the non-responding portions of the population. These differences may have an impact both on the assignment of treatment (for those predictors that individuals have control over) as well as the outcome of interest (income), thereby confounding our results.

Another obvious source of possible confounders in my model are the x-number of variables from the original data set that I chose to omit. While omitting these variables reduces the risk of collinearity and yields a simpler, lower variance model, it also excludes many factors we might like to control for. Specifically in relation to my second alternative hypothesis, which speculated about influences eminating from one’s family unit, it may have been helpful to include those variables that related to respondents’ attitudes about gender, as well as those variables that would intuitively influence those attitudes, such as one’s religious beliefs and perhaps region (e.g., whether the respondent is from the more conservative south, or a rural rather than urban environment, etc.). such factors could easily operate as confounders in the context of that analysis.

Similarly, in the context of my first alternative hypothesis, which proposed that the wage gap between men and women might be explainable on account of systematic differences in men and women’s professional qualifications and/or occupational choices, it might have been informative to include the variable coding respondents’ occupations (occupation) rather than simply industry (industry). Almost all industries feature a prominent hierarchical structure that may parse male and female workers more neatly than the different industries themselves. Likewise, one’s position with this hierarchy may be more strongly predictive of one’s income than one’s choice of industry itself, in which case indsutry would be largely operating as a red herring in our analysis. Another variable I didn’t look at, but which could be acting as a confounder in relation to this analysis is that coding whether the respondent had a criminal record or history of drug abuse, both of which would be expected to correlated negatively with one’s job prospects and income. Based on what we know about the relationship between gender and criminal behavior (i.e., that men are much more likely to have a criminal record than women), however, I would not expect the inclusion of this variable to mitigate the wage gap we observe between men and women. If anything, the failure to control for it may actually be suppressing the true extent of men’s advantage over women in income.

Next, I address the issue of plausibility of the models presented in my final analysis.

(b) Plausibility of Models

All in all, the models presented in my final analysis told a consistent story of wage discrimination against women. The wage gap observed between men and women persisted across all other parameters we examined, including choice of industry, level of education, number of jobs, marital status, family size, and spouse’s income, providing strong evidence that it is, in fact, gender that is responsible for the differences we observe. At least to this extent, I believe my models to be telling an accurate story, and I find the results perfectly plausible.

The perhaps more interesting question is whether these other factors analyzed may be serving to mitigate or exacerbate the effect of gender on income. Unfortunately, this is where things become a lot more murky. First of all, the diagnostic plots indicate that a linear regression is not appropriate for modeling the relationship between income and my other chosen variables. This effectively undermines the plausibility of my linear models off the bat.

Unfortunately, the tabular and graphical summaries are much less useful for drawing the sorts of conclusions I’m interested in drawing, and are much more open to alternative interpretations. This is not to repudiate their plausibility, which I feel is reasonably strong; but it does mean I’m more limited in what I’m able to conclude on the basis of those models alone. I think the strongest case is made by taking all the tabular and graphical summaries together, and finding the explanation that consistently fits them all. However, I don’t think I’ve accumulated a sufficient number of consistent models to feel like I’m able to rule out the many other alternative narratives that might fit them equally well. In the final analysis, then, while I feel the tabular and graphical summaries are certainly plausible, I don’t think they constitute decisive evidence in support of any particular conclusion.

Before closing, I’ll consider the question of how much confidence I have in my analysis, i.e., whether I believe my conclusions and would feel confident presenting them to policy makers.

(c) Confidence in Analysis

While I feel reasonably confident in the broad strokes of my analysis - e.g., that there is a wage gap between men and women, it is not fully explained by the influence of other, non-gender related factors, that there are significant interaction effects between gender and the other factors examined, etc. - I am significantly less confident in my estimates of specific coefficients, especially in light of what was learned from my diagnostic plots concerning the appropriateness of a linear regression to model the relationship between income and my other chosen variables. If my assessment of those plots is accurate, then we cannot rely on the standard interpretation of model coefficients or any analysis that is based on them.

In my assessment, the main utility of this analysis is to illuminate those patterns and trends that most directly bear on the question of whether there is a gender wage gap, to help us readjust or fine-tune our expectations according to what those patterns reveal, and to help us ask better, more well-targeted questions moving forward. Where I would encourage policy-makers to direct their immediate attention and effort is in identifying a more appropriate model for mapping the relationship between gender and income given what we now know about the limitations of a straight-forward linear model. Doing so should allow us to get beyond the sort of loose speculation I’ve engaged in in this report and closer to the sorts of causal claims we’re ultimately interested in making.